Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PG15] Feature/replicas #279

Merged
merged 2 commits into from
Apr 13, 2023
Merged

[PG15] Feature/replicas #279

merged 2 commits into from
Apr 13, 2023

Conversation

MMeent
Copy link

@MMeent MMeent commented Apr 11, 2023

MMeent added 2 commits April 11, 2023 13:36
Add condition variable for WAL recovery; allowing backends to wait for recovery up to some record pointer.
This fixes some test failures that showed up after updating Neon code to do
more precise handling of replica's get_page_at_lsn's request_lsn lsns.
@MMeent MMeent requested a review from knizhnik April 11, 2023 11:47
@MMeent MMeent merged commit aee72b7 into REL_15_STABLE_neon Apr 13, 2023
@MMeent MMeent deleted the feature/replicas-v15 branch April 13, 2023 20:42
MMeent added a commit that referenced this pull request May 11, 2023
* Recovery requirements:

Add condition variable for WAL recovery; allowing backends to wait for recovery up to some record pointer.

* Fix issues w.r.t. WAL when LwLsn is initiated and when recovery starts.
This fixes some test failures that showed up after updating Neon code to do
more precise handling of replica's get_page_at_lsn's request_lsn lsns.

---------

Co-authored-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
tristan957 pushed a commit that referenced this pull request Aug 10, 2023
* Recovery requirements:

Add condition variable for WAL recovery; allowing backends to wait for recovery up to some record pointer.

* Fix issues w.r.t. WAL when LwLsn is initiated and when recovery starts.
This fixes some test failures that showed up after updating Neon code to do
more precise handling of replica's get_page_at_lsn's request_lsn lsns.

---------

Co-authored-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
tristan957 pushed a commit that referenced this pull request Nov 8, 2023
* Recovery requirements:

Add condition variable for WAL recovery; allowing backends to wait for recovery up to some record pointer.

* Fix issues w.r.t. WAL when LwLsn is initiated and when recovery starts.
This fixes some test failures that showed up after updating Neon code to do
more precise handling of replica's get_page_at_lsn's request_lsn lsns.

---------

Co-authored-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
tristan957 pushed a commit that referenced this pull request Nov 8, 2023
* Recovery requirements:

Add condition variable for WAL recovery; allowing backends to wait for recovery up to some record pointer.

* Fix issues w.r.t. WAL when LwLsn is initiated and when recovery starts.
This fixes some test failures that showed up after updating Neon code to do
more precise handling of replica's get_page_at_lsn's request_lsn lsns.

---------

Co-authored-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
tristan957 pushed a commit that referenced this pull request Nov 8, 2023
* Recovery requirements:

Add condition variable for WAL recovery; allowing backends to wait for recovery up to some record pointer.

* Fix issues w.r.t. WAL when LwLsn is initiated and when recovery starts.
This fixes some test failures that showed up after updating Neon code to do
more precise handling of replica's get_page_at_lsn's request_lsn lsns.

---------

Co-authored-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
tristan957 pushed a commit that referenced this pull request Feb 5, 2024
* Recovery requirements:

Add condition variable for WAL recovery; allowing backends to wait for recovery up to some record pointer.

* Fix issues w.r.t. WAL when LwLsn is initiated and when recovery starts.
This fixes some test failures that showed up after updating Neon code to do
more precise handling of replica's get_page_at_lsn's request_lsn lsns.

---------

Co-authored-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
tristan957 pushed a commit that referenced this pull request Feb 5, 2024
* Recovery requirements:

Add condition variable for WAL recovery; allowing backends to wait for recovery up to some record pointer.

* Fix issues w.r.t. WAL when LwLsn is initiated and when recovery starts.
This fixes some test failures that showed up after updating Neon code to do
more precise handling of replica's get_page_at_lsn's request_lsn lsns.

---------

Co-authored-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
tristan957 pushed a commit that referenced this pull request Feb 6, 2024
* Recovery requirements:

Add condition variable for WAL recovery; allowing backends to wait for recovery up to some record pointer.

* Fix issues w.r.t. WAL when LwLsn is initiated and when recovery starts.
This fixes some test failures that showed up after updating Neon code to do
more precise handling of replica's get_page_at_lsn's request_lsn lsns.

---------

Co-authored-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
tristan957 pushed a commit that referenced this pull request May 10, 2024
* Recovery requirements:

Add condition variable for WAL recovery; allowing backends to wait for recovery up to some record pointer.

* Fix issues w.r.t. WAL when LwLsn is initiated and when recovery starts.
This fixes some test failures that showed up after updating Neon code to do
more precise handling of replica's get_page_at_lsn's request_lsn lsns.

---------

Co-authored-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
tristan957 pushed a commit that referenced this pull request May 20, 2024
* Recovery requirements:

Add condition variable for WAL recovery; allowing backends to wait for recovery up to some record pointer.

* Fix issues w.r.t. WAL when LwLsn is initiated and when recovery starts.
This fixes some test failures that showed up after updating Neon code to do
more precise handling of replica's get_page_at_lsn's request_lsn lsns.

---------

Co-authored-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
@alexanderlaw
Copy link

@MMeent , sorry for bumping the old issue, but I've stumbled upon a dubious thing introduced here.
Namely, this in XLogWaitForReplayOf(), e.g., in postgres-v15:

	timeout = ConditionVariableTimedSleep(&XLogRecoveryCtl->replayProgressCV,
										  10000000, /* 10 seconds */
										  WAIT_EVENT_RECOVERY_WAL_STREAM);

Does 10000000 really mean 10 seconds for that function?

I'm seeing inside that function:

            cur_timeout = timeout - (long) INSTR_TIME_GET_MILLISEC(cur_time);

Maybe the value passed is incorrect?

(I have discovered this when noticed that that sleep lasts for more than 5 minutes.)

@MMeent
Copy link
Author

MMeent commented Nov 29, 2024

Yep, seems like it indeed is off by factor 1000.

@MMeent
Copy link
Author

MMeent commented Nov 29, 2024

I've created PRs for the issue for all PG versions and the Neon repo, so this won't remain an issue for much longer. Thank you for reporting this issue!

@alexanderlaw
Copy link

Thank you for paying attention to this!

github-merge-queue bot pushed a commit to neondatabase/neon that referenced this pull request Nov 29, 2024
The previous value assumed usec precision, while the timeout used is in
milliseconds, causing replica backends to wait for (potentially) many
hours for WAL replay without the expected progress reports in logs.

This fixes the issue.

Reported-By: Alexander Lakhin <exclusion@gmail.com>

## Problem


neondatabase/postgres#279 (comment)

The timeout value was configured with the assumption the indicated value
would be microseconds, where it's actually milliseconds. That causes the
backend to wait for much longer (2h46m40s) before it emits the "I'm
waiting for recovery" message. While we do have wait events configured
on this, it's not great to have stuck backends without clear logs, so
this fixes the timeout value in all our PostgreSQL branches.

## PG PRs

* PG14: neondatabase/postgres#542
* PG15: neondatabase/postgres#543
* PG16: neondatabase/postgres#544
* PG17: neondatabase/postgres#545
awarus pushed a commit to neondatabase/neon that referenced this pull request Dec 5, 2024
The previous value assumed usec precision, while the timeout used is in
milliseconds, causing replica backends to wait for (potentially) many
hours for WAL replay without the expected progress reports in logs.

This fixes the issue.

Reported-By: Alexander Lakhin <exclusion@gmail.com>

## Problem


neondatabase/postgres#279 (comment)

The timeout value was configured with the assumption the indicated value
would be microseconds, where it's actually milliseconds. That causes the
backend to wait for much longer (2h46m40s) before it emits the "I'm
waiting for recovery" message. While we do have wait events configured
on this, it's not great to have stuck backends without clear logs, so
this fixes the timeout value in all our PostgreSQL branches.

## PG PRs

* PG14: neondatabase/postgres#542
* PG15: neondatabase/postgres#543
* PG16: neondatabase/postgres#544
* PG17: neondatabase/postgres#545
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants